1. Introduction & Motivation

“Forecasting” is always the most increasing topic in the house-price market and has gained importance and popularity among the general public and companies of all sizes due to its economic benefits and low risk. At the same time, the real estate industry accounts for a large part of GPA in the United States, which can directly influence citizens’ consumption level as well as the overall domestic economy[1]. Zillow is one of the most popular tech real estate companies founded in 2006. They have realized that its housing market predictions are not so accurate as the estimation is very complex as it involves a wide variety of attributes, like crime rate, park number, land use type and so on. Accurate house price forecasts are not only beneficial to investors and consumers, but are also the basis for stable economic development.

So as for this project, our aim to bring geospatial analysis and machine learning method together to generate our model. Through Ordinary Least Squares (OLS) linear regression to estimate the housing price in Mecklenberg County, NC.

The project has encountered difficulties and has some limitations. As we all know, house prices are an integrated and diverse result related to the internal characteristics of the house itself, amenities, public features, the spatial structure of the city, and even to the development experience the United States. Many geospatial factors cannot be quantified,which will lead to inaccuracy. The acquisition and pre-processing of data from different sources and formats is also a challenge, and possible errors in the data will propagate and accumulate. What’s more, the OLS model is not so appropriate for house price prediction.

Our main strategy is to use many factors that may have an impact on house prices, combine with OLS to train and test our model. The steps are divided into four parts: Data Wrangling; Exploratory Analysis; Feature Engineering and Selection; Model Estimation and Validation. The details are in the Section 3.

Our result model is based on 39 features. Within the 27.38% price prediction accuracy range (MAPE of 0.27), it explains about 69% of the variation in Mecklenberg house prices (adjusted R-squared of 0.69). The model’s Moran’s I shows that there is still some degree of spatial autocorrelation in the errors (Moran’s I of 0.19). We divided the study area into four groups. The results show that there is no significant difference in race, income, percentage of vacant and renter occupied, meaning that our model is conditional generalizable. To a certain extent, we can believe that our OLS model is possible to work as a reference for Zillow to predict house price.

knitr::opts_chunk$set(echo = TRUE)
options(scipen=999)
options(tigris_class = "sf")

# Load some libraries
library(factoextra)
library(FactoMineR)
library(tidyverse)
library(tidycensus)
library(sf)
library(spdep)
library(caret)
library(ckanr)
library(FNN)
library(grid)
library(gridExtra)
library(ggcorrplot) # plot correlation plot
library(corrplot)
library(corrr)      # another way to plot correlation plot
library(kableExtra)
library(jtools)     # for regression model plots
library(ggstance) # to support jtools plots
library(ggpubr)    # plotting R^2 value on ggplot point scatter
library(broom.mixed) # needed for effects plots
library(stargazer)

source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")

palette5 <- c("#25CB10", "#5AB60C", "#8FA108",   "#C48C04", "#FA7800")

census_api_key("155cc525674a0d27c98afbb7030d0802e9bb5543", overwrite = TRUE)

2. Data Manipulation and Visualization

2.1 Data Collection

According to the hedonic model, house prices are affected by three constituent parts: 1)physical characteristics; 2)public services or amenities; 3)the spatial process of prices. Therefore, we gather data from those three aspects.

Physical characteristics are the intrinsic characteristics of houses, like the number of bedrooms of the house or the heated area. We choose 8 physical characteristics in total like heated area, full bathrooms, bldggrade, half bathrooms, bedrooms, house size, Area, house age as features to predict the price, all of them are from studentData.geojson on the github.

Public services or amenities are like crimes or schools, we choose 19 characteristics of this part, and percentage of people with bachelor’s degree or more, percentage of poverty, percentage of white residents, vacant lands, population density, commercial construction, job density, financial services, impervious surface, house density, single family, violent crime, public transportation, park, library, shopping center, schools, medical places are used to predict house price. Those data are from both ACS and Mecklenberg County’s Open Data portal.

We also create some new features from existing data by using KNN. The new feature are 5 nearest neighbors of parks, libraries, shopping centers, schools, and medical places.

Mecklenburg.sf <-
  st_read("./data/studentData/studentData.geojson") %>%
  dplyr::select(lot_num,totalac, municipali, price, yearbuilt,city, heatedarea, storyheigh, aheatingty, heatedfuel, actype, numfirepla, bldggrade, fullbaths, halfbaths, bedrooms, units, accounttyp, landusecod, shape_Area, toPredict, musaID) %>%
  mutate(house_age = 2022 - yearbuilt,
         pricepersq=price/shape_Area) %>%
  st_transform('EPSG:32119') %>%
  na.omit()

Mecklenburg.challenge <-
  Mecklenburg.sf[Mecklenburg.sf$toPredict == "CHALLENGE",]

Mecklenburg.tracts20 <- 
  get_acs(geography = "tract", variables = c("B25026_001E","B02001_002E",
                                             "B19013_001E","B15001_009",
                                             "B06012_002E", "B15001_050",
                                             # vacant variables
                                             "B25002_003E", 
                                             "B25004_002E","B25004_003E",
                                             "B25004_004E","B25004_005E",
                                             # total housing unit
                                             "B25001_001E",
                                             # renter occupied 
                                             'B08137_003E'), 
          year=2020, state= "NC", county= "Mecklenburg", geometry=T, output="wide") %>%
  st_transform('EPSG:32119') %>%
  rename(TotalPop = B25026_001E, 
         Whites = B02001_002E,
         FemaleBachelors = B15001_050E,
         MaleBachelors = B15001_009E,
         MedHHInc = B19013_001E, 
         TotalPoverty = B06012_002E,
         TotalVacant = B25002_003E,
         ForRent = B25004_002E,
         ForRentVac = B25004_003E,
         ForSale = B25004_004E,
         ForSaleVac = B25004_005E,
         TotalUnit = B25001_001E,
         RenterOccupied = B08137_003E) %>%
  dplyr::select(-NAME, -starts_with("B")) %>%
  mutate(pctWhite = ifelse(TotalPop > 0, Whites / TotalPop, 0),
         pctBachelors = ifelse(TotalPop > 0, ((FemaleBachelors + MaleBachelors) / TotalPop), 0),
         pctPoverty = ifelse(TotalPop > 0, TotalPoverty / TotalPop, 0),
         pctTotalVacant = ifelse(TotalUnit > 0, TotalVacant / TotalUnit * 100, 0),
         TotalOccupied = TotalUnit - TotalVacant,
         pctRenterOccupied = ifelse(TotalOccupied >0, RenterOccupied/TotalOccupied, 0)) %>%
  dplyr::select(-Whites, -FemaleBachelors, -MaleBachelors, -TotalPoverty)

Mecklenburg.neighborhood <-
  st_read("./data/Quality of Life Explorer/Character/Area.geojson") %>%
  st_transform('EPSG:32119')

## Join studentData with ACS data
Mecklenburg.sf <-
  st_join(Mecklenburg.sf, Mecklenburg.tracts20) 

Mecklenburg.sf <-
  st_join(Mecklenburg.sf, Mecklenburg.neighborhood)

## External Data (Quality of Life)

## 1. Character

Mecklenburg.vacant_land <-
  st_read("./data/Quality of Life Explorer/Character/Vacant Land.geojson") %>%
  st_transform('EPSG:32119') %>%
  dplyr::select(X2021.raw) %>%
  rename(vacant_land = X2021.raw) %>%
  na.omit()

Mecklenburg.race_AF <-
  st_read("./data/Quality of Life Explorer/Character/Race_Ethnicity - Black or African American.geojson") %>%
  st_transform('EPSG:32119') %>%
  dplyr::select(X2020) %>%
  rename(race_AF = X2020) %>%
  na.omit()

Mecklenburg.age_Res <-
  st_read("./data/Quality of Life Explorer/Character/Age of Residents.geojson") %>%
  st_transform('EPSG:32119') %>%
  dplyr::select(X2020) %>%
  rename(age_Res = X2020) %>%
  na.omit()

Mecklenburg.older_pop <-
  st_read("./data/Quality of Life Explorer/Character/Population - Older Adult.geojson") %>%
  st_transform('EPSG:32119') %>%
  dplyr::select(X2020) %>%
  rename(older_pop = X2020) %>%
  na.omit()

Mecklenburg.pop_density <-
  st_read("./data/Quality of Life Explorer/Character/Population Density.geojson") %>%
  st_transform('EPSG:32119') %>%
  dplyr::select(X2020.raw) %>%
  rename(pop_density = X2020.raw) %>%
  na.omit()


## 2. Economy
Mecklenburg.employment <-
  st_read("./data/Quality of Life Explorer/Economy/Employment.geojson") %>%
  st_transform('EPSG:32119') %>%
  dplyr::select(X2020) %>%
  rename(employment = X2020) %>%
  na.omit()

Mecklenburg.commercial_construction <-
  st_read("./data/Quality of Life Explorer/Economy/Commercial Construction.geojson") %>%
  st_transform('EPSG:32119') %>%
  dplyr::select(X2020.raw) %>%
  rename(commercial_construction = X2020.raw) %>%
  na.omit()

Mecklenburg.job_density <-
  st_read("./data/Quality of Life Explorer/Economy/Job Density.geojson") %>%
  st_transform('EPSG:32119') %>%
  dplyr::select(X2019.raw) %>%
  rename(job_density = X2019.raw) %>%
  na.omit()

Mecklenburg.financial_services <-
  st_read("./data/Quality of Life Explorer/Economy/Proximity to Financial Services.geojson") %>%
  st_transform('EPSG:32119') %>%
  dplyr::select(X2020.raw) %>%
  rename(financial_services = X2020.raw) %>%
  na.omit()

Mecklenburg.food_service <-
  st_read("./data/Quality of Life Explorer/Economy/Food and Nutrition Services.geojson") %>%
  st_transform('EPSG:32119') %>%
  dplyr::select(X2020) %>%
  rename(food_service = X2020)%>%
  na.omit()


## 3. Education

Mecklenburg.early_edu <-
  st_read("./data/Quality of Life Explorer/Education/Proximity to Early Care and Education.geojson") %>%
  st_transform('EPSG:32119')%>%
  dplyr::select(X2020.raw) %>%
  rename(early_edu = X2020.raw) %>%
  na.omit()

## 4. Environment

Mecklenburg.drive_alone <-
  st_read("./data/Quality of Life Explorer/Environment/Commuters Driving Alone.geojson") %>%
  st_transform('EPSG:32119')%>%
  dplyr::select(X2020.raw) %>%
  rename(drive_alone = X2020.raw) %>%
  na.omit()

Mecklenburg.impervious_surface <-
  st_read("./data/Quality of Life Explorer/Environment/Impervious Surface.geojson") %>%
  st_transform('EPSG:32119')%>%
  dplyr::select(X2021.raw) %>%
  rename(impervious_surface = X2021.raw) %>%
  na.omit()

Mecklenburg.gas_consumption <-
  st_read("./data/Quality of Life Explorer/Environment/Energy Consumption - Natural Gas.geojson") %>%
  st_transform('EPSG:32119')%>%
  dplyr::select(X2013.raw) %>%
  rename(gas_consumption = X2013.raw) %>%
  na.omit()

Mecklenburg.tree_res <-
  st_read("./data/Quality of Life Explorer/Environment/Tree Canopy - Residential.geojson") %>%
  st_transform('EPSG:32119')%>%
  dplyr::select(X2012.raw) %>%
  rename(tree_res = X2012.raw) %>%
  na.omit()

## 5. Health
Mecklenburg.recreation <-
  st_read("./data/Quality of Life Explorer/Health/Proximity to Public Outdoor Recreation.geojson") %>%
  st_transform('EPSG:32119')%>%
  dplyr::select(X2020.raw) %>%
  rename(recreation = X2020.raw) %>%
  na.omit()

Mecklenburg.age_death <-
  st_read("./data/Quality of Life Explorer/Health/Age of Death.geojson") %>%
  st_transform('EPSG:32119')%>%
  dplyr::select(X2019) %>%
  rename(age_death = X2019) %>%
  na.omit()

## 6. Housing

Mecklenburg.house_size <-
  st_read("./data/Quality of Life Explorer/Housing/Housing Size.geojson") %>%
  st_transform('EPSG:32119') %>%
  dplyr::select(X2020) %>%
  rename(house_size = X2020)%>%
  na.omit()

Mecklenburg.house_density <-
  st_read("./data/Quality of Life Explorer/Housing/Housing Density.geojson") %>%
  st_transform('EPSG:32119') %>%
  dplyr::select(X2020.raw) %>%
  rename(house_density = X2020.raw)%>%
  na.omit()

Mecklenburg.residential_occupancy <-
  st_read("./data/Quality of Life Explorer/Housing/Residential Occupancy.geojson") %>%
  st_transform('EPSG:32119') %>%
  dplyr::select(X2020) %>%
  rename(residential_occupancy = X2020)%>%
  na.omit()

Mecklenburg.single_family <-
  st_read("./data/Quality of Life Explorer/Housing/Single-Family Housing.geojson") %>%
  st_transform('EPSG:32119') %>%
  dplyr::select(X2020.raw) %>%
  rename(single_family = X2020.raw)%>%
  na.omit()


## 7. Safety

Mecklenburg.violent_crime <-
  st_read("./data/Quality of Life Explorer/Safety/Crime - Violent.geojson") %>%
  st_transform('EPSG:32119')%>%
  dplyr::select(X2020.raw) %>%
  rename(violent_crime = X2020.raw)%>%
  na.omit()

## 8. Transportation
Mecklenburg.public_transportation <-
  st_read("./data/Quality of Life Explorer/Transportation/Proximity to Public Transportation.geojson") %>%
  st_transform('EPSG:32119')%>%
  dplyr::select(X2020.raw) %>%
  rename(public_transportation = X2020.raw)

## Points Data

Mecklenburg.park <-
  st_read("./data/parks.geojson") %>%
  st_transform('EPSG:32119') %>%
  dplyr::select(geometry)

Mecklenburg.library <-
  st_read("./data/Libraries.geojson") %>%
  st_transform('EPSG:32119') %>%
  dplyr::select(geometry)

Mecklenburg.medical <-
  st_read("./data/Medical_Facilities.geojson") %>%
  st_transform('EPSG:32119') %>%
  dplyr::select(geometry)

Mecklenburg.shopping <-
  st_read("./data/Existing_Shopping_Centers.geojson") %>%
  st_transform('EPSG:32119') %>%
  dplyr::select(geometry)

Mecklenburg.schools <-
  st_read("./data/Schools.geojson") %>%
  st_transform('EPSG:32119') %>%
  dplyr::select(geometry)



## Create Nearest Neighbor Feature

## 1. Character

Mecklenburg.sf <-
  st_join(Mecklenburg.sf, Mecklenburg.vacant_land) %>%
  na.omit()


# Mecklenburg.sf <-
#   st_join(Mecklenburg.sf, Mecklenburg.older_pop)%>%
#   na.omit()

Mecklenburg.sf <-
  st_join(Mecklenburg.sf, Mecklenburg.pop_density)%>%
  na.omit()

# Mecklenburg.sf <-
#   st_join(Mecklenburg.sf, Mecklenburg.age_Res)%>%
#   na.omit()


Mecklenburg.challenge <-
  Mecklenburg.sf[Mecklenburg.sf$toPredict == "CHALLENGE",]

## 2. Economy

Mecklenburg.sf <-
  st_join(Mecklenburg.sf, Mecklenburg.employment) %>%
  na.omit()

Mecklenburg.sf <-
  st_join(Mecklenburg.sf, Mecklenburg.commercial_construction)%>%
  na.omit()

Mecklenburg.sf <-
  st_join(Mecklenburg.sf, Mecklenburg.job_density)%>%
  na.omit()

Mecklenburg.sf <-
  st_join(Mecklenburg.sf, Mecklenburg.financial_services)%>%
  na.omit()

Mecklenburg.sf <-
  st_join(Mecklenburg.sf, Mecklenburg.food_service)%>%
  na.omit()

## 3. Education

Mecklenburg.sf <-
  st_join(Mecklenburg.sf, Mecklenburg.early_edu)%>%
  na.omit()

## 4. Environment

Mecklenburg.sf <-
  st_join(Mecklenburg.sf, Mecklenburg.impervious_surface)%>%
  na.omit()

Mecklenburg.sf <-
  st_join(Mecklenburg.sf, Mecklenburg.drive_alone)%>%
  na.omit()

Mecklenburg.sf <-
  st_join(Mecklenburg.sf, Mecklenburg.gas_consumption)%>%
  na.omit()

Mecklenburg.sf <-
  st_join(Mecklenburg.sf, Mecklenburg.tree_res)%>%
  na.omit()

## 5. Health

Mecklenburg.sf <-
  st_join(Mecklenburg.sf, Mecklenburg.recreation)%>%
  na.omit()

Mecklenburg.sf <-
  st_join(Mecklenburg.sf, Mecklenburg.age_death)%>%
  na.omit()

## 6. Housing

Mecklenburg.sf <-
  st_join(Mecklenburg.sf, Mecklenburg.house_size)%>%
  na.omit()

Mecklenburg.sf <-
  st_join(Mecklenburg.sf, Mecklenburg.house_density)%>%
  na.omit()


Mecklenburg.sf <-
  st_join(Mecklenburg.sf, Mecklenburg.single_family)%>%
  na.omit()

Mecklenburg.sf <-
  st_join(Mecklenburg.sf, Mecklenburg.residential_occupancy)%>%
  na.omit()

## 7. Safety

Mecklenburg.sf <-
  st_join(Mecklenburg.sf, Mecklenburg.violent_crime)%>%
  na.omit()

## 8. Transportation

Mecklenburg.sf <-
  st_join(Mecklenburg.sf, Mecklenburg.public_transportation)%>%
  na.omit()

## Points Data


## Park

Mecklenburg.sf <-
  Mecklenburg.sf %>%
  mutate(
    park_nn1 =
      nn_function(st_coordinates(st_centroid(Mecklenburg.sf)), st_coordinates(st_centroid(Mecklenburg.park)), 1),
    park_nn2 =
      nn_function(st_coordinates(st_centroid(Mecklenburg.sf)), st_coordinates(st_centroid(Mecklenburg.park)), 2),
    park_nn3 =
      nn_function(st_coordinates(st_centroid(Mecklenburg.sf)), st_coordinates(st_centroid(Mecklenburg.park)), 3),
    park_nn4 =
      nn_function(st_coordinates(st_centroid(Mecklenburg.sf)), st_coordinates(st_centroid(Mecklenburg.park)), 4),
    park_nn5 =
      nn_function(st_coordinates(st_centroid(Mecklenburg.sf)), st_coordinates(st_centroid(Mecklenburg.park)), 5))%>%
  na.omit()

## Library

Mecklenburg.sf <-
  Mecklenburg.sf %>%
  mutate(
    library_nn1 =
      nn_function(st_coordinates(st_centroid(Mecklenburg.sf)), st_coordinates(st_centroid(Mecklenburg.library)), 1),
    library_nn2 =
      nn_function(st_coordinates(st_centroid(Mecklenburg.sf)), st_coordinates(st_centroid(Mecklenburg.library)), 2),
    library_nn3 =
      nn_function(st_coordinates(st_centroid(Mecklenburg.sf)), st_coordinates(st_centroid(Mecklenburg.library)), 3),
    library_nn4 =
      nn_function(st_coordinates(st_centroid(Mecklenburg.sf)), st_coordinates(st_centroid(Mecklenburg.library)), 4),
    library_nn5 =
      nn_function(st_coordinates(st_centroid(Mecklenburg.sf)), st_coordinates(st_centroid(Mecklenburg.library)), 5))%>%
  na.omit()

## Shopping center

Mecklenburg.sf <-
  Mecklenburg.sf %>%
  mutate(
    shopping_nn1 =
      nn_function(st_coordinates(st_centroid(Mecklenburg.sf)), st_coordinates(st_centroid(Mecklenburg.shopping)), 1),
    shopping_nn2 =
      nn_function(st_coordinates(st_centroid(Mecklenburg.sf)), st_coordinates(st_centroid(Mecklenburg.shopping)), 2),
    shopping_nn3 =
      nn_function(st_coordinates(st_centroid(Mecklenburg.sf)), st_coordinates(st_centroid(Mecklenburg.shopping)), 3),
    shopping_nn4 =
      nn_function(st_coordinates(st_centroid(Mecklenburg.sf)), st_coordinates(st_centroid(Mecklenburg.shopping)), 4),
    shopping_nn5 =
      nn_function(st_coordinates(st_centroid(Mecklenburg.sf)), st_coordinates(st_centroid(Mecklenburg.shopping)), 5))%>%
  na.omit()

## Schools

Mecklenburg.sf <-
  Mecklenburg.sf %>%
  mutate(
    school_nn1 =
      nn_function(st_coordinates(st_centroid(Mecklenburg.sf)), st_coordinates(st_centroid(Mecklenburg.schools)), 1),
    school_nn2 =
      nn_function(st_coordinates(st_centroid(Mecklenburg.sf)), st_coordinates(st_centroid(Mecklenburg.schools)), 2),
    school_nn3 =
      nn_function(st_coordinates(st_centroid(Mecklenburg.sf)), st_coordinates(st_centroid(Mecklenburg.schools)), 3),
    school_nn4 =
      nn_function(st_coordinates(st_centroid(Mecklenburg.sf)), st_coordinates(st_centroid(Mecklenburg.schools)), 4),
    school_nn5 =
      nn_function(st_coordinates(st_centroid(Mecklenburg.sf)), st_coordinates(st_centroid(Mecklenburg.schools)), 5))%>%
  na.omit()

## Crime

# Mecklenburg.sf <-
#   Mecklenburg.sf %>%
#   mutate(
#     crime_nn1 =
#       nn_function(st_coordinates(st_centroid(Mecklenburg.sf)), st_coordinates(st_centroid(Mecklenburg.crime)), 1),
#     crime_nn2 =
#       nn_function(st_coordinates(st_centroid(Mecklenburg.sf)), st_coordinates(st_centroid(Mecklenburg.crime)), 2),
#     crime_nn3 =
#       nn_function(st_coordinates(st_centroid(Mecklenburg.sf)), st_coordinates(st_centroid(Mecklenburg.crime)), 3),
#     crime_nn4 =
#       nn_function(st_coordinates(st_centroid(Mecklenburg.sf)), st_coordinates(st_centroid(Mecklenburg.crime)), 4),
#     crime_nn5 =
#       nn_function(st_coordinates(st_centroid(Mecklenburg.sf)), st_coordinates(st_centroid(Mecklenburg.crime)), 5))

## Medical

Mecklenburg.sf <-
  Mecklenburg.sf %>%
  mutate(
    medical_nn1 =
      nn_function(st_coordinates(st_centroid(Mecklenburg.sf)), st_coordinates(st_centroid(Mecklenburg.medical)), 1),
    medical_nn2 =
      nn_function(st_coordinates(st_centroid(Mecklenburg.sf)), st_coordinates(st_centroid(Mecklenburg.medical)), 2),
    medical_nn3 =
      nn_function(st_coordinates(st_centroid(Mecklenburg.sf)), st_coordinates(st_centroid(Mecklenburg.medical)), 3),
    medical_nn4 =
      nn_function(st_coordinates(st_centroid(Mecklenburg.sf)), st_coordinates(st_centroid(Mecklenburg.medical)), 4),
    medical_nn5 =
      nn_function(st_coordinates(st_centroid(Mecklenburg.sf)), st_coordinates(st_centroid(Mecklenburg.medical)), 5))%>%
  na.omit()

2.2 Summary Statistics

This table statistics some features of independent variables.

Mecklenburg.select_data <-
  Mecklenburg.sf %>%
  select(price,heatedarea, fullbaths, bldggrade,  halfbaths, bedrooms, shape_Area, house_age, MedHHInc, TotalVacant, TotalUnit,  pctWhite, pctPoverty, vacant_land,pop_density, commercial_construction, job_density, financial_services, impervious_surface, house_size,house_density, single_family,  violent_crime, public_transportation, park_nn2,park_nn3,park_nn5, library_nn1,library_nn2,library_nn3,library_nn4,library_nn5, shopping_nn1, shopping_nn2,shopping_nn3, school_nn2,school_nn3, medical_nn3,medical_nn4,medical_nn5)%>%
  st_drop_geometry()

internal.feature.list <- Mecklenburg.select_data %>%
  dplyr:: select(heatedarea, fullbaths, bldggrade,  halfbaths, bedrooms, shape_Area, house_age, house_size,house_density) 

stargazer(internal.feature.list, type = "html", 
          title = "Table 2. Summary Statistics of Internal Characteristics",
          header = FALSE,
          single.row = TRUE)
Table 2. Summary Statistics of Internal Characteristics
Statistic N Mean St. Dev. Min Max
heatedarea 45,341 2,362.307 1,061.263 0.000 14,710.000
fullbaths 45,341 2.283 0.827 0 9
halfbaths 45,341 0.595 0.528 0 4
bedrooms 45,341 3.510 0.944 0 65
shape_Area 45,341 15,933.740 35,219.760 1,139.637 3,486,865.000
house_age 45,341 28.652 24.051 0 194
house_size 45,341 2,281.600 730.491 1,005.000 5,494.000
house_density 45,341 1,308.028 910.955 69.100 11,163.000
piblic.feature.list <- Mecklenburg.select_data %>%
  dplyr:: select(financial_services, public_transportation, park_nn2,park_nn3,park_nn5, library_nn1,library_nn2,library_nn3,library_nn4,library_nn5, shopping_nn1, shopping_nn2,shopping_nn3, school_nn2,school_nn3, medical_nn3,medical_nn4,medical_nn5) 

stargazer(piblic.feature.list, type = "html", 
          title = "Table 2. Summary Statistics of Public Services Characteristics",
          header = FALSE,
          single.row = TRUE)
Table 2. Summary Statistics of Public Services Characteristics
Statistic N Mean St. Dev. Min Max
financial_services 45,341 363.688 786.328 0 9,462
public_transportation 45,341 703.682 942.436 0 11,641
park_nn2 45,341 1,513.375 790.582 63.144 5,064.762
park_nn3 45,341 1,738.650 819.041 90.029 5,653.027
park_nn5 45,341 2,147.716 880.371 184.989 6,754.394
library_nn1 45,341 4,026.632 2,238.726 73.891 10,454.780
library_nn2 45,341 5,452.666 2,569.007 333.465 13,914.410
library_nn3 45,341 6,379.752 2,914.899 635.328 16,460.460
library_nn4 45,341 7,066.235 3,112.351 886.952 17,780.290
library_nn5 45,341 7,833.089 3,229.467 1,017.134 18,692.330
shopping_nn1 45,341 1,774.212 1,306.096 43.122 9,488.096
shopping_nn2 45,341 2,023.224 1,314.477 113.299 9,689.564
shopping_nn3 45,341 2,232.628 1,318.495 147.817 9,789.222
school_nn2 45,341 1,534.650 857.392 97.415 6,752.533
school_nn3 45,341 1,803.053 939.254 249.679 7,589.063
medical_nn3 45,341 2,046.510 1,365.741 45.589 8,104.070
medical_nn4 45,341 2,164.763 1,388.329 48.756 8,406.003
medical_nn5 45,341 2,269.014 1,408.529 52.703 8,627.653
spatial.feature.list <- Mecklenburg.select_data %>%
  dplyr:: select(MedHHInc, TotalVacant, TotalUnit,  pctWhite, pctPoverty, vacant_land,pop_density, commercial_construction, job_density, impervious_surface,  single_family,violent_crime) 

stargazer(spatial.feature.list, type = "html", 
          title = "Table 2. Summary Statistics of Spatial Structure Characteristics",
          header = FALSE,
          single.row = TRUE)
Table 2. Summary Statistics of Spatial Structure Characteristics
Statistic N Mean St. Dev. Min Max
MedHHInc 45,341 86,346.610 37,587.750 17,685 238,750
TotalVacant 45,341 95.336 75.152 0 405
TotalUnit 45,341 1,606.760 464.513 249 2,778
pctWhite 45,341 0.571 0.272 0.011 1.710
pctPoverty 45,341 0.090 0.082 0.000 0.614
vacant_land 45,341 243.841 393.899 0.000 1,755.900
pop_density 45,341 3,373.346 2,032.687 485.000 16,580.900
commercial_construction 45,341 8.667 18.127 0 485
job_density 45,341 1,322.237 3,911.261 4.000 96,789.000
impervious_surface 45,341 147.362 104.904 19.000 706.700
single_family 45,341 953.650 585.037 2 3,649
violent_crime 45,341 9.730 15.200 0 198

2.3 Correlation Matrix

This table reflects the correlation between each variables. It can help us to find variables which have strong correlations between each other. In that case, we can reduce redundant features.

numericVars <- 
  select_if(st_drop_geometry(Mecklenburg.select_data), is.numeric) %>% na.omit()

M <- cor(numericVars,use="pairwise.complete.obs")
corrplot(M, method = "square",type = "lower",tl.cex = 0.5)

2.4 Scatter plots of Four Interesting Predictors and Sale Price

The first two plots shows the correlations between price and house intrinsic characteristics, which shows strong correlations. It means the intrinsic characteristics of houses are important for price prediction. To our surprise, the school feature doesn’t show a strong correlation to the price which is somewhat against common sense. For the plot of pctWhite which represents the correlation of percentage of white people and price, it reflects some outliers. This plot can help us to find these outliers and remove them.

st_drop_geometry(Mecklenburg.sf) %>% 
  dplyr::select(price, heatedarea,pctWhite,school_nn2,house_size) %>%
  filter(price <= 10000000) %>%
  gather(Variable, Value,-price) %>% 
   ggplot(aes(Value, price)) +
     geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "#FA7800") +
     facet_wrap(~Variable, ncol = 2, scales = "free") +
     labs(title = "Price as a function of continuous variables") +
     plotTheme()

2.5 Map of House Price

This map shows the spatial difference of house sale price.

new_palette5 = c("#EFFFFD", "#B8FFF9", "#85F4FF", "#42C2FF", "#03045E")

Mecklenburg.modelling <-
  Mecklenburg.sf[Mecklenburg.sf$toPredict == "MODELLING",] %>%
  filter(price < 10000000) %>%
  filter(price > 0) %>%
  na.omit()

ggplot() +
  geom_sf(data=st_union(Mecklenburg.tracts20)) +
  geom_sf(data = Mecklenburg.sf, aes(color = q5(price)), show.legend = "point",size = 0.8) +
  scale_color_manual(values = alpha(new_palette5, .6),
                     labels = qBr(Mecklenburg.modelling,'price'),
                     name = "Price, $") +
  labs(title = "Map of House Price in Mecklenburg\n") +
  mapTheme() +
  plotTheme()

2.6 Maps of Three Independent Variables

Similar to the map of house price, these three maps reflects spatial distributions of heated area, house area, and violent crime. From the plot we can find, the most southern and northern parts has better heat conditions, larger house area, and less violent crimes. The middle area is diametrically opposed. The plot reflects spatial difference of the features which is helpful for us to understand the characteristics of each region and choose more useful features for the model.

ggplot() +
  geom_sf(data=st_union(Mecklenburg.tracts20)) +
  geom_sf(data = Mecklenburg.sf, aes(color = q5(heatedarea)), show.legend = "point",size = 0.8) +
  scale_color_manual(values = alpha(new_palette5, .6),
                     labels = qBr(Mecklenburg.sf,'heatedarea'),
                     name = "Heated Area") +
  labs(title = "Map of Heated Area in Mecklenburg\n") +
  mapTheme() +
  plotTheme()

ggplot() +
  geom_sf(data=st_union(Mecklenburg.tracts20)) +
  geom_sf(data = Mecklenburg.sf, aes(color = q5(shape_Area)), show.legend = "point",size = 0.8) +
  scale_color_manual(values = alpha(new_palette5, .6),
                     labels = qBr(Mecklenburg.sf,'shape_Area'),
                     name = "Heated Area") +
  labs(title = "Map of House Area in Mecklenburg\n") +
  mapTheme() +
  plotTheme()

ggplot() +
  geom_sf(data=st_union(Mecklenburg.tracts20)) +
  geom_sf(data = Mecklenburg.sf, aes(color = q5(violent_crime)), show.legend = "point",size = 0.8) +
  scale_color_manual(values = alpha(new_palette5, .6),
                     labels = qBr(Mecklenburg.sf,'violent_crime'),
                     name = "Heated Area") +
  labs(title = "Map of violent crime in Mecklenburg\n") +
  mapTheme() +
  plotTheme()

2.7 Some interesting maps

The first plot shows each feature’s contribution to the PCA, which can reflect which feature is more important in some extent.

The rest plots are heated maps to reflect density of certain feature’s spatial distribution.

Mecklenburg.numeric <- Mecklenburg.sf[, unlist(lapply(Mecklenburg.sf,
                                   is.numeric))]
  
Mecklenburg.numeric_nogeo<- Mecklenburg.numeric %>% st_drop_geometry()

Mecklenburg.pca <- prcomp(Mecklenburg.numeric_nogeo, center = TRUE,scale. = TRUE)

Mecklenburg.pca_plot2<-fviz_contrib(Mecklenburg.pca, choice = "var", axes = 1:2)
Mecklenburg.pca_plot2 + 
  theme(axis.text = element_text(size = 10),
        axis.text.x = element_text(size = 9, angle = 55))

ggplot() + 
  geom_sf(data = st_union(Mecklenburg.tracts20)) +
  stat_density2d(data = data.frame(st_coordinates(Mecklenburg.public_transportation)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.002, bins = 30, geom = 'polygon') +
  scale_fill_gradient(low = "#c4fff9", high = "#07beb8", name = "public_transportation") +
  scale_alpha(range = c(0.00, 0.2), guide = FALSE) +
  labs(title = "Density of recreations, Mecklenburg") +
  mapTheme()

ggplot() + geom_sf(data = st_union(Mecklenburg.tracts20)) +
  stat_density2d(data = data.frame(st_coordinates(Mecklenburg.shopping)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.002, bins = 30, geom = 'polygon') +
  scale_fill_gradient(low = "#c4fff9", high = "#07beb8", name = "shopping") +
  scale_alpha(range = c(0.00, 0.35), guide = FALSE) +
  labs(title = "Density of schools, Mecklenburg") +
  mapTheme()

ggplot() + geom_sf(data = st_union(Mecklenburg.tracts20)) +
  stat_density2d(data = data.frame(st_coordinates(Mecklenburg.job_density)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.002, bins = 30, geom = 'polygon') +
  scale_fill_gradient(low = "#c4fff9", high = "#07beb8", name = "Job") +
  scale_alpha(range = c(0.00, 0.35), guide = FALSE) +
  labs(title = "Density of jobs, Mecklenburg") +
  mapTheme()

3. Methods

For this project, we aim to train a model to predict the house price as accurately as possible and we will combine geospatial analytics and machine learning method (OLS) together to generate our model. The steps are divided into four sections:

3.1 Data Wrangling

Data wrangling is the first step to clean all the data we collected. We import all the data and set them to the same coordinate and format so that subsequent operations can be performed. In addition, we need to remove some abnormal data like NA. Then, merge all the data to one dataset according to the geographic location relationship and separate them into two parts, training data and testing data, normally it’s 80%:20%. So we complete the preparation.

3.2 Exploratory Analysis

Exploratory analysis helps us get a better a glimpse of data. It is a good approach of analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods.

3.3 Feature Engineering and Selection

The feature engineering and selection is done in the following steps:

  • In order to choose the most effective features, we have put all the features into the OLS model to test the fitness firstly, then summarized the model’s performance, the output gives the p-value and the coefficient for each variable. P-value is the probability, which reflects the likelihood of an event occurring. In other words, it reflects the probability of having the coefficient assuming there is no correlation between the independent and the dependent variables. In general, the p-value should be smaller than 0.05. R-squared is also calcutated, which is a goodness-of-fit measure for linear regression models. It measures the strength of the relationship between your linear model and the dependent variables on a 0 - 100% scale, which means how close the data are to the fitted regression line. In general, the higher the R-squared, the better the model fits your data.

  • We checked each p-value of the variables, which tell us what variables to keep or remove. After processing all the data, we ran OLS model again to compare the R-squared value. We have done multiple trials to select the best set of predictors.

3.4 Model Estimation and Validation

After the final model is obtained, We have evaluated the model’s performance by R-Squared, MAE, MAPE and also conducted cross-validation. Moran’s I is calculated to assess the spatial autocorrelation of the residuals and to assess the generalizability of the model using race, income, income, percentage of vacant and renter occupied data. Zillow manager can visualize the spatial autocorrelation results of the model through Moran’s I and assess the generalizability based on tables of different background data.

4. Results

4.1 Dataset Splitting and Model Building

First of all, We separated the Modelling data and the Challenge data according to “toPredict” field, then divided the Modelling Data set into training data and test data, the ratio is 8:2.

set.seed(56235)

Mecklenburg.sf <-
  Mecklenburg.sf %>%
  na.omit()

Mecklenburg.modelling <-
  Mecklenburg.sf[Mecklenburg.sf$toPredict == "MODELLING",] %>%
  filter(price < 10000000) %>%
  filter(price > 0)

Train <-
  createDataPartition(y = paste(Mecklenburg.modelling$lot_num), p=0.8, list=F)

Mecklenburg.train <-
  Mecklenburg.sf[Train,] %>%
  filter(price < 10000000) %>%
  filter(price > 0) %>%
  na.omit()

Mecklenburg.test <-
  Mecklenburg.sf[-Train,] %>%
  filter(price < 10000000) %>%
  filter(price > 0)%>%
  na.omit()

Mecklenburg.challenge <-
  Mecklenburg.sf[Mecklenburg.sf$toPredict == "CHALLENGE",]
Model1 <- lm(price ~ ., data = as.data.frame(Mecklenburg.train) %>%
               dplyr::select(price,heatedarea, fullbaths, bldggrade,  halfbaths, bedrooms, shape_Area, house_age, MedHHInc, TotalVacant, TotalUnit,  pctWhite, pctPoverty, vacant_land,pop_density, commercial_construction,job_density, financial_services, impervious_surface, house_size,house_density, single_family,  violent_crime, public_transportation, park_nn2,park_nn3,park_nn5, library_nn1,library_nn2,library_nn3,library_nn4,library_nn5, shopping_nn1, shopping_nn2,shopping_nn3, school_nn2,school_nn3, medical_nn3,medical_nn4,medical_nn5))

require(broom)
    
tidy(Model1)%>% kable() %>% kable_styling("striped", full_width = F)
term estimate std.error statistic p.value
(Intercept) -589.875784 11286.3099636 -0.0522647 0.9583181
heatedarea 117.742249 2.1799902 54.0104485 0.0000000
fullbaths 30518.304270 2398.0293332 12.7264099 0.0000000
bldggradeCUSTOM 1422819.145248 17399.0560938 81.7756514 0.0000000
bldggradeEXCELLENT 621760.392032 9500.8813240 65.4423912 0.0000000
bldggradeFAIR -27185.063311 9226.1121763 -2.9465351 0.0032156
bldggradeGOOD 56212.575404 3140.2232686 17.9008212 0.0000000
bldggradeMINIMUM -37023.688393 45443.6613239 -0.8147162 0.4152401
bldggradeVERY GOOD 207247.631392 5150.8209981 40.2358442 0.0000000
halfbaths 9786.691299 2434.8363058 4.0194453 0.0000585
bedrooms -14895.132035 1433.0037424 -10.3943427 0.0000000
shape_Area 1.354752 0.0351807 38.5083579 0.0000000
house_age 287.054390 64.0170322 4.4840315 0.0000073
MedHHInc 1.027735 0.0585072 17.5659519 0.0000000
TotalVacant 286.490631 18.6141373 15.3910238 0.0000000
TotalUnit -15.295481 2.8639178 -5.3407544 0.0000001
pctWhite 37461.480319 6808.2532560 5.5023629 0.0000000
pctPoverty 151705.386239 19279.3279406 7.8688109 0.0000000
vacant_land -57.666582 4.9446296 -11.6624675 0.0000000
pop_density -20.600938 2.9485742 -6.9867458 0.0000000
commercial_construction 2256.721799 157.5523662 14.3236300 0.0000000
job_density -10.955546 0.7722128 -14.1872108 0.0000000
financial_services 22.513701 3.6194575 6.2201867 0.0000000
impervious_surface 196.890334 19.0317241 10.3453756 0.0000000
house_size 40.008371 3.2892727 12.1632878 0.0000000
house_density 8.273438 7.1223532 1.1616158 0.2453992
single_family 32.016694 6.5905870 4.8579427 0.0000012
violent_crime -971.384101 109.3081394 -8.8866585 0.0000000
public_transportation 16.396322 3.2617575 5.0268366 0.0000005
park_nn2 31.229070 7.4517298 4.1908484 0.0000279
park_nn3 -21.767661 10.9520610 -1.9875401 0.0468701
park_nn5 -4.459184 5.6912329 -0.7835181 0.4333281
library_nn1 4.057601 1.3288621 3.0534401 0.0022640
library_nn2 -24.546828 4.7918202 -5.1226522 0.0000003
library_nn3 86.134111 8.8259271 9.7592139 0.0000000
library_nn4 -53.623143 6.5495622 -8.1872865 0.0000000
library_nn5 -25.952219 2.3449607 -11.0672296 0.0000000
shopping_nn1 -12.190330 4.4562668 -2.7355477 0.0062307
shopping_nn2 35.698107 10.1788744 3.5070780 0.0004536
shopping_nn3 -45.500832 7.7424419 -5.8768063 0.0000000
school_nn2 -65.467935 6.2483614 -10.4776167 0.0000000
school_nn3 77.334244 6.4120949 12.0606831 0.0000000
medical_nn3 145.049239 21.3493322 6.7940879 0.0000000
medical_nn4 -229.874447 43.5386823 -5.2797750 0.0000001
medical_nn5 100.560697 24.6402499 4.0811557 0.0000449
stargazer(Model1, type = "html",
          title = "Table 4.1.1 Summary Statistics of Model 1 ",
          header = FALSE,
          single.row = TRUE, align = F, notes.align = "c")
Table 4.1.1 Summary Statistics of Model 1
Dependent variable:
price
heatedarea 117.742*** (2.180)
fullbaths 30,518.300*** (2,398.029)
bldggradeCUSTOM 1,422,819.000*** (17,399.060)
bldggradeEXCELLENT 621,760.400*** (9,500.881)
bldggradeFAIR -27,185.060*** (9,226.112)
bldggradeGOOD 56,212.570*** (3,140.223)
bldggradeMINIMUM -37,023.690 (45,443.660)
bldggradeVERY GOOD 207,247.600*** (5,150.821)
halfbaths 9,786.691*** (2,434.836)
bedrooms -14,895.130*** (1,433.004)
shape_Area 1.355*** (0.035)
house_age 287.054*** (64.017)
MedHHInc 1.028*** (0.059)
TotalVacant 286.491*** (18.614)
TotalUnit -15.295*** (2.864)
pctWhite 37,461.480*** (6,808.253)
pctPoverty 151,705.400*** (19,279.330)
vacant_land -57.667*** (4.945)
pop_density -20.601*** (2.949)
commercial_construction 2,256.722*** (157.552)
job_density -10.956*** (0.772)
financial_services 22.514*** (3.619)
impervious_surface 196.890*** (19.032)
house_size 40.008*** (3.289)
house_density 8.273 (7.122)
single_family 32.017*** (6.591)
violent_crime -971.384*** (109.308)
public_transportation 16.396*** (3.262)
park_nn2 31.229*** (7.452)
park_nn3 -21.768** (10.952)
park_nn5 -4.459 (5.691)
library_nn1 4.058*** (1.329)
library_nn2 -24.547*** (4.792)
library_nn3 86.134*** (8.826)
library_nn4 -53.623*** (6.550)
library_nn5 -25.952*** (2.345)
shopping_nn1 -12.190*** (4.456)
shopping_nn2 35.698*** (10.179)
shopping_nn3 -45.501*** (7.742)
school_nn2 -65.468*** (6.248)
school_nn3 77.334*** (6.412)
medical_nn3 145.049*** (21.349)
medical_nn4 -229.874*** (43.539)
medical_nn5 100.561*** (24.640)
Constant -589.876 (11,286.310)
Observations 36,153
R2 0.698
Adjusted R2 0.697
Residual Std. Error 202,421.100 (df = 36108)
F Statistic 1,893.992*** (df = 44; 36108)
Note: p<0.1; p<0.05; p<0.01

4.2 Training Set Summary Results

The summary of R-squared and adjusted R-squared for the training set are demonstrated in the following table. The current model reached 0.697 R-squared, indicating that 69.7% of the variance for of the dependent variable can be explained by this prediction.

Mecklenburg.model <-
  Model1

broom::glance(Mecklenburg.model) %>%
  kable(caption = 'Table 4.2 Summary of Model Evaluation Parameters') %>%
  kable_styling("striped", full_width = F)
Table 4.2 Summary of Model Evaluation Parameters
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.6976985 0.6973301 202421.1 1893.992 0 44 -492997.5 986087.1 986477.9 1479500623083658 36108 36153
Musa.result <-
  Mecklenburg.challenge %>%
  mutate(prediction = predict(Mecklenburg.model, Mecklenburg.challenge)) %>%
  select(musaID, prediction) %>%
  st_drop_geometry()

write.csv(Musa.result,"musa_prediction.csv", row.names = FALSE)

4.3 Test Set Summary Results

For the test set which occupied 20% of the Modelling set, it shows that this model is not so accurate. The R-squared is about 70%, MAE is 104032.5, MAPE is 0.27.

Mecklenburg.train <-
  Mecklenburg.train %>%
  mutate(Price.Predict = predict(Mecklenburg.model, Mecklenburg.train),
         Price.Error = price - Price.Predict,
         Price.AbsError = abs(price - Price.Predict),
         Price.APE = (abs(price - Price.Predict))/price)

Mecklenburg.test <-
  Mecklenburg.test %>%
  mutate(Regression = "Baseline Regression",
         Price.Predict = predict(Mecklenburg.model, Mecklenburg.test),
         Price.Error = price - Price.Predict,
         Price.AbsError = abs(price - Price.Predict),
         Price.APE = (abs(price - Price.Predict))/price)

Mecklenburg.test$Price.APE[is.infinite(Mecklenburg.test$Price.APE)] <- NA

Mecklenburg.test.rsq <-
  cor(Mecklenburg.test$price, Mecklenburg.test$Price.Predict, method = "pearson", use = "complete.obs") ^ 2

Mecklenburg.test.mae <-
  mean(Mecklenburg.test$Price.AbsError, na.rm = T)

Mecklenburg.test.mape <-
  mean(Mecklenburg.test$Price.APE, na.rm = T)

Mecklenburg.test.results<-as.data.frame(
  cbind(Mecklenburg.test.rsq,
        Mecklenburg.test.mae,
        Mecklenburg.test.mape))
  
colnames(Mecklenburg.test.results)<-c("R Square", "Mean Absolute Errors (MAE)","MAPE")

kable(Mecklenburg.test.results,
      caption = 'Table 4.3 R Square, MAE and MAPE for Test Set') %>%
  kable_styling("striped", full_width = F)
Table 4.3 R Square, MAE and MAPE for Test Set
R Square Mean Absolute Errors (MAE) MAPE
0.7012363 104032.5 0.2738515
as.data.frame(Mecklenburg.test) %>%
  dplyr::select(price, Price.Predict) %>%
  gather(Variable, Value) %>%
  ggplot(aes(Value, fill = Variable)) + 
  geom_density(alpha = 0.5) + 
  scale_fill_manual(values = c("#FA7070", "#5F9DF7")) +
  labs(title="Distribution of price & price predictions", 
       x = "Price/Prediction", y = "Density of observations") +
  plotTheme()

4.4 Cross-Validation

This section conducted Cross-Validation to assess how our model would generalize to other independent data sets.

Below is a summary table and distribution plots of R-square, RMSE, and MAE in each of 100 fold are shown below.

The variation of 100 folds can also be visualized with a histogram of across-fold MAE. The MAE in this case is a little bit high, most of them are distributed in the range of 95,000-115,000.

fitControl <- trainControl(method = "cv",
                           number = 100,
                           savePredictions = TRUE)

set.seed(99526)

Mecklenburg.CV <-
  train(price ~ ., data = as.data.frame(Mecklenburg.train)%>%
               dplyr::select(price,heatedarea, fullbaths, bldggrade,  halfbaths, bedrooms, shape_Area, house_age,  MedHHInc, TotalVacant, TotalUnit,  pctWhite, pctPoverty, vacant_land,pop_density, commercial_construction,job_density, financial_services, impervious_surface, house_size,house_density, single_family,  violent_crime, public_transportation, park_nn2,park_nn3,park_nn5, library_nn1,library_nn2,library_nn3,library_nn4,library_nn5, shopping_nn1, shopping_nn2,shopping_nn3, school_nn2,school_nn3, medical_nn3,medical_nn4,medical_nn5), 
     method = "lm", trControl = fitControl, na.action = na.pass)

##  show RMSE, MAE, R^2
kable(Mecklenburg.CV$resample,
          caption = 'Table 4.4 Cross-validation Test: Summary of RMSE, R-Squared and MAE') %>%
  kable_styling("striped", full_width = F)
Table 4.4 Cross-validation Test: Summary of RMSE, R-Squared and MAE
RMSE Rsquared MAE Resample
132257.3 0.8063728 84820.55 Fold001
161176.3 0.8245584 91627.30 Fold002
143440.5 0.7856015 98463.79 Fold003
129314.0 0.8260462 88971.01 Fold004
255219.4 0.7656121 120821.66 Fold005
197422.8 0.6718350 103693.93 Fold006
276923.7 0.6130969 111500.39 Fold007
232732.6 0.6858177 116387.19 Fold008
247954.3 0.7128319 115563.49 Fold009
197185.3 0.8088958 103750.77 Fold010
204485.9 0.6986820 107799.66 Fold011
170709.9 0.8294433 98909.21 Fold012
140775.5 0.7804158 97487.33 Fold013
122332.7 0.8092515 90851.77 Fold014
360176.5 0.4231420 125185.84 Fold015
242608.7 0.5891597 113806.27 Fold016
210585.3 0.7378107 120795.35 Fold017
225945.1 0.5437389 107771.93 Fold018
180223.2 0.6920633 113771.32 Fold019
211713.5 0.8215357 107660.98 Fold020
158765.8 0.7371448 96501.41 Fold021
130025.8 0.8043685 90157.75 Fold022
150926.9 0.7956205 91550.51 Fold023
155582.8 0.7060099 99306.53 Fold024
186902.5 0.6279338 94974.17 Fold025
261224.4 0.7198698 116014.06 Fold026
150088.2 0.8106802 99277.38 Fold027
174647.5 0.8120858 110684.95 Fold028
294016.2 0.5557784 109909.05 Fold029
165277.8 0.7019629 106194.40 Fold030
405842.6 0.2658657 123017.19 Fold031
153707.2 0.7988618 100984.49 Fold032
160859.7 0.7394718 106457.13 Fold033
210592.8 0.6864077 107114.18 Fold034
144080.0 0.7654732 92343.26 Fold035
306485.3 0.7187494 112315.66 Fold036
206330.2 0.5911698 106766.14 Fold037
272401.3 0.6797514 111245.84 Fold038
213143.4 0.6634821 107959.97 Fold039
188989.3 0.6222590 102960.25 Fold040
182165.7 0.8131559 102326.44 Fold041
162554.0 0.7890582 99603.82 Fold042
174860.4 0.7662719 110612.53 Fold043
201125.9 0.6368694 107114.37 Fold044
175263.7 0.8202691 106682.15 Fold045
136964.7 0.8418540 89897.38 Fold046
154284.3 0.6999854 96882.70 Fold047
209183.1 0.5596055 102941.48 Fold048
183263.8 0.8233870 108672.04 Fold049
141719.2 0.7509117 90963.34 Fold050
279560.3 0.7047242 113589.62 Fold051
301293.0 0.5628124 127806.95 Fold052
160618.5 0.8091244 95233.29 Fold053
186601.0 0.7545888 107837.78 Fold054
186333.6 0.6383631 106536.79 Fold055
421772.6 0.3875774 123887.24 Fold056
221857.1 0.5784830 102803.71 Fold057
143340.0 0.7732882 94426.81 Fold058
142584.0 0.6887276 88806.77 Fold059
165833.7 0.7361380 95620.19 Fold060
160261.6 0.8014695 104766.62 Fold061
249866.6 0.6059430 110747.92 Fold062
207968.0 0.8085016 109472.87 Fold063
161370.0 0.7304463 100396.74 Fold064
144151.5 0.8109968 97604.23 Fold065
207927.4 0.7516596 107100.11 Fold066
171202.7 0.6676488 100866.86 Fold067
127927.8 0.7162603 90090.60 Fold068
201288.7 0.6709075 102839.46 Fold069
175163.2 0.7440159 98329.05 Fold070
360557.1 0.3425005 113116.91 Fold071
182137.0 0.6813015 101403.91 Fold072
165871.2 0.7328143 98821.93 Fold073
125076.4 0.7147360 92119.64 Fold074
175945.2 0.7866469 96858.68 Fold075
206174.7 0.7391501 116005.52 Fold076
147059.0 0.7775983 98824.30 Fold077
166975.2 0.7966021 95503.65 Fold078
170504.3 0.6983789 96905.48 Fold079
140757.6 0.7760661 90882.39 Fold080
139492.8 0.7807296 93847.61 Fold081
132556.4 0.8508215 91322.44 Fold082
175608.0 0.8201767 101951.87 Fold083
192816.9 0.8012079 95045.92 Fold084
152448.5 0.7896580 90708.35 Fold085
170354.3 0.7492709 103824.44 Fold086
169118.5 0.7430909 101156.30 Fold087
239328.8 0.7751722 107760.72 Fold088
381995.9 0.4821003 110335.99 Fold089
146055.7 0.8678784 100494.46 Fold090
175482.8 0.6861038 98360.95 Fold091
177243.9 0.7426306 100420.13 Fold092
186142.0 0.7794631 109381.21 Fold093
243245.1 0.6360583 108347.35 Fold094
200240.9 0.6826500 111066.18 Fold095
178125.8 0.7755755 104399.46 Fold096
158678.7 0.8112389 93617.09 Fold097
172738.6 0.7000035 105153.28 Fold098
166553.7 0.7018630 100582.40 Fold099
131337.1 0.8126508 87196.65 Fold100
# plot MAE distribution
ggplot() +
  geom_histogram(data = Mecklenburg.CV$resample, aes(MAE), bins = 50, fill = "orange") +
  labs(title="Distribution of MAE",
       subtitle = "k-fold cross validation; k = 100",
       x="Mean Absolute Error", y="Count") +
  plotTheme()

4.5 Comparison Between Predicted Price and Observed Price

The Figure below demonstrated the predicted prices as a function of observed price, indicating that there are still some errors existing between the regression line (shown in blue) and the reference line (shown in orange). Both the training and test sets are in the same situation.

## Merge Predicted and Observed Price

Mecklenburg.train$Source <- "Training set"
Mecklenburg.test$Source <- "Test set"

Mecklenburg.alldata <-
  rbind(Mecklenburg.train, Mecklenburg.test %>%
          dplyr::select(-Regression))

ggplot(Mecklenburg.alldata, aes(x = price, y = Price.Predict, color = Source)) +
  geom_point() +
  geom_smooth(method = "lm", color = "blue") +
  geom_abline(color = "orange") +
  labs(title = "Predicted Prices as a Function of Observed Prices",
       subtitle = 'Blue line is model regression line\nwhile orange line is reference line\n',
       caption = '',
       x = "Observed Prices",
       y = "Predicted Prices") +
  theme() +
  plotTheme()

## By sets

ggplot(Mecklenburg.alldata, aes(x = price, y = Price.Predict, color = Source)) +
  geom_point() +
  geom_smooth(method = "lm", color = "blue") +
  geom_abline(color = "orange") +
  coord_equal() +
  theme_bw() +
  facet_wrap(~Source, ncol = 2) +
  labs(title = "Predicted Prices as a Function of Observed Prices, by Sets",
       subtitle = 'Blue line is model regression line while orange line is reference line\n',
       caption = '',
       x = "Observed Prices",
       y = "Predicted Prices") +
  theme() +
  plotTheme()

4.6 Map of Test Set’s Residuals

This section used three different tools to examine if prediction errors display certain spatial patterns.

  1. Mapping residuals: The maps below shows that most of the errors are clustering in the central and southern Mecklenburg.

  2. The spatial lag in errors: A spatial lag is a variable that averages the neighboring values of a location. It shows that as house price errors increase, the nearby house price errors slightly increase.

  3. Moran’s I: Moran’s I is a measure of the overall spatial autocorrelation. In general, when Moran’s I > 0, it indicates positive spatial correlation. The larger the value, the more obvious the spatial correlation. When Moran’s I < 0, it indicates negative spatial correlation, the smaller the value, the larger the spatial difference, otherwise, When Moran’s I = 0, the space is random.

ggplot() +
  geom_sf(data=st_union(Mecklenburg.tracts20)) +
  geom_sf(data = Mecklenburg.test, aes(color = q5(Price.Error)), show.legend = "point",size = 0.8) +
  scale_color_manual(values = alpha(new_palette5, .6),
                     labels = qBr(Mecklenburg.test,'Price.Error'),
                     name = "Residuals") +
  labs(title = "Map of residuals on test set\n") +
  mapTheme() +
  plotTheme()

ggplot() +
  geom_sf(data=st_union(Mecklenburg.tracts20)) +
  geom_sf(data = Mecklenburg.test, aes(color = q5(Price.AbsError)), show.legend = "point",size = 0.8) +
  scale_color_manual(values = alpha(new_palette5, .6),
                     labels = qBr(Mecklenburg.test,'Price.AbsError'),
                     name = "Absolute Errors") +
  labs(title = "Map of absolute errors on test set\n") +
  mapTheme() +
  plotTheme()

The plot below is the spatial lag in errors.

# Average Errors in Five Nearest Neighbors

Mecklenburg.test_predict <- Mecklenburg.test[which(Mecklenburg.test$Price.Error != 0),]

coords.test <- Mecklenburg.test_predict %>%
  st_coordinates()

neighborList.test <- knn2nb(knearneigh(coords.test, 5))

spatialWeights.test <- nb2listw(neighborList.test, style="W")

Mecklenburg.test_predict$lagPriceError <- lag.listw(spatialWeights.test, Mecklenburg.test_predict$Price.Error)
  
ggplot(Mecklenburg.test_predict, aes(x = lagPriceError, y = Price.Error)) +
  geom_point(color = "black") +
  geom_smooth(method = "lm", se = FALSE, colour = "darkred") +
  labs(title = "Error on Test Set as a function of the Spatial Lag of Price\n",
       caption = "",
       x = "Spatial lag of errors (Mean error of 5 nearest neighbors)",
       y = "Errors of Sale Price") +
  plotTheme()

In our model, the Moran’s I is 0.19, which means our model is randomly distributed to some extend.

moranTest <- moran.mc(Mecklenburg.test_predict$Price.Error, 
                      spatialWeights.test, nsim = 999)

ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
  geom_histogram(binwidth = 0.01) +
  geom_vline(aes(xintercept = moranTest$statistic), colour = "#FA7800",size=1) +
  scale_x_continuous(limits = c(-1, 1)) +
  labs(title="Observed and permuted Moran's I",
       subtitle= "Observed Moran's I in orange",
       x="Moran's I",
       y="Count") +
  plotTheme()

4.7 Map of Predicted Prices Using the Whole Dataset

Applying the model to the whole dataset, the distribution of price in different set is shown in the map below. Housing prices are significant higher in the central, southern area of Mecklenburg, and a few northern areas. There is no obvious difference between the pattern of the distribution of price predictions in the two datasets.

Mecklenburg.modelling$Price.Predict <- predict(Mecklenburg.model, newdata = Mecklenburg.modelling)


Mecklenburg.challenge$Price.Predict <- predict(Mecklenburg.model, Mecklenburg.challenge)


ggplot() +
  geom_sf(data=st_union(Mecklenburg.tracts20)) +
  geom_sf(data = Mecklenburg.modelling, aes(color = q5(Price.Predict)), show.legend = "point",size = 0.8) +
  scale_color_manual(values = alpha(new_palette5, .6),
                     labels = qBr(Mecklenburg.modelling,'Price.Predict'),
                     name = "Price, $") +
  labs(title = 'Map of Predicted Price in Modelling set\n',
       subtitle = '',
       caption = '') +
  plotTheme() +
  mapTheme()

ggplot() +
  geom_sf(data=st_union(Mecklenburg.tracts20)) +
  geom_sf(data = Mecklenburg.challenge, aes(color = q5(Price.Predict)), show.legend = "point",size = 2) +
  scale_color_manual(values = alpha(new_palette5, .6),
                     labels = qBr(Mecklenburg.challenge,'Price.Predict'),
                     name = "Price, $") +
  labs(title = 'Map of Predicted Price in Challenge set\n',
       subtitle = '',
       caption = '') +
  plotTheme() +
  mapTheme()

4.8 Map of MAPE by Neighborhood

The map below is the mean absolute percentage error (MAPE) by neighborhood of the test set predictions.

Mecklenburg.nhood <- Mecklenburg.test %>% 
  group_by(id) %>%
  summarize(meanPrice = mean(price, na.rm = T),
            meanPrediction = mean(Price.Predict, na.rm = T),
            meanMAE = mean(Price.AbsError, na.rm = T),
            meanMAPE = mean(Price.APE, na.rm = T)) %>%
  st_drop_geometry %>%
  left_join(Mecklenburg.neighborhood,by = 'id') %>%
  dplyr::select(-X2020)

# kable(Mecklenburg.nhood) %>% 
#   kable_styling("striped", full_width = F)

Mecklenburg.nhood <-
  Mecklenburg.nhood %>%
  st_sf()

warm_palette5 <- c("#FBDA91", "#FB9E91", "#FB8691", "#FB0091", "#D2001A")

ggplot() + 
  geom_sf(data = Mecklenburg.nhood, aes(fill = q5(meanMAPE))) +
  scale_fill_manual(values = new_palette5,
                    labels = qBr(Mecklenburg.nhood, "meanMAPE", rnd = F),
                    name="Quintile\nBreaks") +
  geom_sf(data = Mecklenburg.test, aes(color = q5(Price.Predict)),show.legend = "point", size = .5) +
  scale_color_manual(values = alpha(warm_palette5, .5),
                     labels = qBr(Mecklenburg.test,'Price.Predict'),
                     name = "Price, $") +
  labs(title = "Mean Test Set MAPE by Neighborhood") +
  mapTheme()

4.9 Plot MAPE by Neighborhood

The plot below is the MAPE by neighborhood as a function of mean price by neighborhood.

ggplot()+
  geom_point(data = Mecklenburg.nhood, aes(x = meanPrice, y = meanMAPE), color = "lightblue") +
  stat_smooth(data = Mecklenburg.nhood, aes(x = meanPrice, y = meanMAPE), method = "loess", se = FALSE, size = 1, colour="red") + 
  labs(title = "MAPE by Neighborhood as a Function of\nMean Price by Neighborhood\n",
       caption = '') +
  xlab('Mean Price by Neighborhood') +
  ylab('Mean MAPE by Neighborhood') +
  plotTheme()

4.10 Model’s Generalizability Test

We have split the study area into four groups, according to race, income, vacant and renter occupied. The table results are as follows. Majority White and high income areas have more accurate prediction results. The maximum variation within the same group reached 5% in the race and renter group. The other two groups had little variation, which means our model is reliable in generalizability to some extend.

Mecklenburg.tracts20 <-
  Mecklenburg.tracts20 %>%
  mutate(raceContext = ifelse(pctWhite > .5, "Majority White", "Majority Non-White"),
         incomeContext = ifelse(MedHHInc > mean(MedHHInc,na.rm = T), "High Income", "Low Income"),
         vacantContext = ifelse(pctTotalVacant > .5, "Majority Vacant", "Majority Non-Vacant"),
         pctRenterContext = ifelse(pctRenterOccupied > .5, "Majority Renter Occupied", "Majority Non-Renter Occupied"))

grid.arrange(ncol = 2,
  ggplot() + geom_sf(data = na.omit(Mecklenburg.tracts20), aes(fill = raceContext)) +
    scale_fill_manual(values = c("#85F4FF", "#FB9E91"), name="Race Context") +
    labs(title = "Race Context") +
    mapTheme() + theme(legend.position="bottom"), 
  ggplot() + geom_sf(data = na.omit(Mecklenburg.tracts20), aes(fill = incomeContext)) +
    scale_fill_manual(values = c("#85F4FF", "#FB9E91"), name="Income Context") +
    labs(title = "Income Context") +
    mapTheme() + theme(legend.position="bottom"))

grid.arrange(ncol = 2,
  ggplot() + geom_sf(data = na.omit(Mecklenburg.tracts20), aes(fill = vacantContext)) +
    scale_fill_manual(values = c("#85F4FF", "#FB9E91"), name="Vacant Context") +
    labs(title = "Vacant Context") +
    mapTheme() + theme(legend.position="bottom"), 
  ggplot() + geom_sf(data = na.omit(Mecklenburg.tracts20), aes(fill = pctRenterContext)) +
    scale_fill_manual(values = c("#85F4FF", "#FB9E91"), name="pctRenter Context") +
    labs(title = "pctRenter Context") +
    mapTheme() + theme(legend.position="bottom"))

Mecklenburg.race <- st_join(Mecklenburg.train, Mecklenburg.tracts20) %>%
  group_by(raceContext) %>%
  summarize(meanMAPE = mean(Price.APE, na.rm = T)) %>%
  st_drop_geometry() %>%
  spread(raceContext, meanMAPE) %>%
  kable(caption = "Table 4.10.1\nTest set MAPE by Neighborhood Racial Context") %>% 
  kable_styling("striped", full_width = F)

Mecklenburg.race
Table 4.10.1 Test set MAPE by Neighborhood Racial Context
Majority Non-White Majority White
0.2852243 0.2372819
Mecklenburg.income <- st_join(Mecklenburg.train, Mecklenburg.tracts20) %>%
  group_by(incomeContext) %>%
  summarize(meanMAPE = mean(Price.APE, na.rm = T)) %>%
  st_drop_geometry() %>%
  spread(incomeContext, meanMAPE) %>%
  kable(caption = "Table 4.10.2\nTest set MAPE by Neighborhood Income Context") %>% 
  kable_styling("striped", full_width = F)
Mecklenburg.income
Table 4.10.2 Test set MAPE by Neighborhood Income Context
High Income Low Income
0.2417418 0.2775655
Mecklenburg.vacant <- st_join(Mecklenburg.train, Mecklenburg.tracts20) %>%
  group_by(vacantContext) %>%
  summarize(meanMAPE = mean(Price.APE, na.rm = T)) %>%
  st_drop_geometry() %>%
  spread(vacantContext, meanMAPE) %>%
  kable(caption = "Table 4.10.3\nTest set MAPE by Neighborhood Vacant Context") %>% 
  kable_styling("striped", full_width = F)

Mecklenburg.vacant
Table 4.10.3 Test set MAPE by Neighborhood Vacant Context
Majority Non-Vacant Majority Vacant
0.2332548 0.2602928
Mecklenburg.rent <- st_join(Mecklenburg.train, Mecklenburg.tracts20) %>%
  group_by(pctRenterContext) %>%
  summarize(meanMAPE = mean(Price.APE, na.rm = T)) %>%
  st_drop_geometry() %>%
  spread(pctRenterContext, meanMAPE) %>%
  kable(caption = "Table 4.10.4\nTest set MAPE by Neighborhood Renter Occupied Context") %>% 
  kable_styling("striped", full_width = F)
Mecklenburg.rent
Table 4.10.4 Test set MAPE by Neighborhood Renter Occupied Context
Majority Non-Renter Occupied Majority Renter Occupied
0.2413728 0.2907168

5. Disscussion

In general, the model is effective in predicting house price. The results shows that the R-squared reached 0.7, which means the feature we selected in the model can explain 70% variation of house prices. However, for the generalizability, it performs not very well. The MAE is up to 104032.5 and MAPE is 0.2738515. It might because that we didn’t consider neighborhood effects and also some feature might fit to some region but not suitable for other places. Therefore, when using new data to test our model, it may not perform very well.

There are some interesting variables which can be considered useful by common sense but actually not. These features are like parks and employments.

We believe that the intrinsic characteristics of the house, such as heated area and number of bedrooms, generally correlate well with price. Then, we consider the properties of the house itself as important features.

The average difference between our price forecasts and actual prices is about 104,032.5. and most of the errors occur in the central and southern parts of Mecklenburg. In addition, the higher the actual price, the larger the error.

Prices in central Mecklenburg are relatively lower than in other regions, especially in the north and south. We speculate that this is related to race and income.

The model performs well in the marginal part of Mecklenburg, especially in the western part, and poorly in the middle part of Mecklenburg. The reason for this is that some features may be important for areas with large errors, but we did not include them in the model. Therefore, the model does not predict these areas well. As for those regions with smaller errors, the features we chose for the model may be just appropriate.

6. Conclusion

Based on our findings, we would recommend our model to Zillow, but I would definitely add some important caveats - our findings can only be used as an auxiliary reference, not as a key determinant, as it is poorly accurate. There are several areas that we need to improve in the future Zillow’s use:

  • We can use more advanced feature engineering methods to transform variables, such as logarithms or power functions, to make them perform better in OLS models.

  • Add some uncovered indicators to the model, especially those that help predict areas of high house prices.

  • Try other types of regression instead of OLS to reduce the error caused by non-linear variables. And if we can, we could apply Deep Learning method to fit our model, which will help a lot.

In this way, we believe our model can provide Zillow a better prediction of house price in the future.

7. References

[1]Wang Y. House-price Prediction Based on OLS Linear Regression and Random Forest[C]//2021 2nd Asia Service Sciences and Software Engineering Conference. 2021: 89-93.